library(tidyverse)
library(nycflights13)
geom_jitter to show a better representation.ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
geom_jitter that control the amount of jittering are width and height.ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 9, height = 0)
geom_count gives us different sized points depending on the amount of overlapping points. Both methods are ways to better view overlapping points. geom_count might have an area that has a large amount of points that can cover other points though, so it might not always be better to use it.ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
position = "dodge".ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = fl, y = cty), position = "dodge")
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = fl, y = cty))
labs() allows you to modify the title, axis, legend, and plot labels. You can rename them.coord_fixed() the x-axis distances are not the same as y-axis distances, so it can skew how people interpret data. Without coord_fixed() it looks like the city mpg increases faster than highway mpg.geom_abline adds in a line based on the parameters you enter into slope and intercept. As default, it makes the intercept = 0 and slope = 1. You can use it as a reference line when comparing the rest of the plot.ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
filter(diamonds, carat > 3)
filter(flights, arr_delay >= 120)
filter(flights, dest %in% c("IAH","HOU"))
airlines to get the list of airlines with their respective abbreviations.airlines
filter(flights, carrier %in% c("UA","AA","DL"))
filter(flights, month %in% c(7,8,9))
filter(flights, dep_delay <= 0 & arr_delay > 120)
filter(flights, dep_delay >= 60 & arr_delay < dep_delay - 30)
filter(flights, between(dep_time, 0, 600))
between() gives us the values that are between the two arguments. It is inclusive. (I had already used it for 1.7. I guess I was supposed to use >= 0 & <= 600)dep_time. Other variables that are missing are dep_delay, arr_time, arr_delay, air_time, and occasionally tailnum. These might represent canceled flights.filter(flights, is.na(dep_time))
NA ^ 0 is not missing because whenever you put anything to the zero power, it is always equal to 1. NA | TRUE is not missing either because the OR operator just requires one thing to be TRUE, and in this case, since one of the arguments is TRUE, it is not missing. FALSE & NA is not missing because for the AND operator, as soon as one argument is FALSE, the entire thing is FALSE.NA * 0 giving us NA. Originally figured it was just if the expression always had the same result regardless of what you replace NA with, it would be fine, but anything multiplied by 0 should be equal to 0.select(flights, dep_time, dep_delay, arr_time, arr_delay)
I had considered using contains() but doing this would end up grabbing sched_dep_time and/or sched_arr_time too. This same problem would arise from grabbing all the columns from dep_time to arr_delay.
We could also use the column numbers.
select(flights, c(4,6,7,9))
select(flights, dep_time, dep_delay, dep_time, dep_time)
one_of() when used in conjunction with select, lets you select all the columns with names in a character vector. So in the example, you won’t have to type all the column names again, and you’ll get the year, month, day, dep_delay, and arr_delay columns.vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
ignore.case = FALSEselect(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))